Team - Triston Hudgins, Shijo Joseph, Osman Kanteh, Douglas Yip
## Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from matplotlib.pyplot import scatter
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
%matplotlib inline
The reason why we selected Lehman's baseball dataset is to understand the difficult decision a Major League Baseball (MLB) General Manager (GM) has to field a competitive team of talent while balancing that the fall below the team's salary cap. Salary caps in baseball are placed to reduce anti-competitive behavior in the league. This creates guardrails and fairness on how contracts are offered to players. Teams that chose to spend more than the salary cap are penalized with "Competitive Balance Tax" (CBT). Teams are asses a 20% tax for their first season above the salary cap and tax rate become more punitive for every consective year above the salary cap.
The performance of a player are usually rewarded contracts of varying degrees. In this analysis, we will see how a players offensive stats can predict the outcome of player's salary. We will categorize the salaries from Low to Elite to measure the effectives of our LDA prediction model.
Sources
- http://origin.mlb.com/glossary/transactions/competitive-balance-tax
- https://bleacherreport.com/articles/32306-open-mic-why-baseball-gms-have-the-most-difficult-job
# load the Lehman's baseball Batting dataset
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331_lab1/main/Batting.csv') # read in the csv file
df = df[(df['yearID'] >= 2000) & (df['yearID'] <= 2015)]
df.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 79265 | abbotje01 | 2000 | 1 | CHA | AL | 80 | 215 | 31 | 59 | 15 | ... | 29.0 | 2.0 | 1.0 | 21 | 38.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 |
| 79266 | abbotku01 | 2000 | 1 | NYN | NL | 79 | 157 | 22 | 34 | 7 | ... | 12.0 | 1.0 | 1.0 | 14 | 51.0 | 2.0 | 1.0 | 0.0 | 1.0 | 2.0 |
| 79267 | abbotpa01 | 2000 | 1 | SEA | AL | 35 | 5 | 1 | 2 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 79268 | abreubo01 | 2000 | 1 | PHI | NL | 154 | 576 | 103 | 182 | 42 | ... | 79.0 | 28.0 | 8.0 | 100 | 116.0 | 9.0 | 1.0 | 0.0 | 3.0 | 12.0 |
| 79269 | aceveju01 | 2000 | 1 | MIL | NL | 62 | 1 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 22 columns
# load the Lehman's baseball salary dataset
df_salary = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331_lab1/main/Salaries.csv') # read in the csv file
df_salary = df_salary[(df_salary['yearID'] >= 2000) & (df_salary['yearID'] <= 2015)]
df_salary.head()
| yearID | teamID | lgID | playerID | salary | |
|---|---|---|---|---|---|
| 12263 | 2000 | ANA | AL | anderga01 | 3250000 |
| 12264 | 2000 | ANA | AL | belchti01 | 4600000 |
| 12265 | 2000 | ANA | AL | botteke01 | 4000000 |
| 12266 | 2000 | ANA | AL | clemeed02 | 215000 |
| 12267 | 2000 | ANA | AL | colanmi01 | 200000 |
Merging of the data sets will help us get a better picture of a players salary and their baseball stats
df = pd.merge(df,df_salary[['playerID','yearID','teamID','salary']],on=['playerID','yearID','teamID'], how='left')
df.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abbotje01 | 2000 | 1 | CHA | AL | 80 | 215 | 31 | 59 | 15 | ... | 2.0 | 1.0 | 21 | 38.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 255000.0 |
| 1 | abbotku01 | 2000 | 1 | NYN | NL | 79 | 157 | 22 | 34 | 7 | ... | 1.0 | 1.0 | 14 | 51.0 | 2.0 | 1.0 | 0.0 | 1.0 | 2.0 | 500000.0 |
| 2 | abbotpa01 | 2000 | 1 | SEA | AL | 35 | 5 | 1 | 2 | 1 | ... | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 285000.0 |
| 3 | abreubo01 | 2000 | 1 | PHI | NL | 154 | 576 | 103 | 182 | 42 | ... | 28.0 | 8.0 | 100 | 116.0 | 9.0 | 1.0 | 0.0 | 3.0 | 12.0 | 2933333.0 |
| 4 | aceveju01 | 2000 | 1 | MIL | NL | 62 | 1 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 612500.0 |
5 rows × 23 columns
#following code will describe the data
df.describe()
| yearID | stint | G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 22083.000000 | 1.284700e+04 |
| mean | 2007.608251 | 1.086854 | 50.089888 | 120.486302 | 16.074718 | 31.480822 | 6.287778 | 0.656478 | 3.622832 | 15.303446 | 2.055291 | 0.825341 | 11.353892 | 24.252276 | 0.889236 | 1.219309 | 1.123625 | 0.971109 | 2.738532 | 3.073175e+06 |
| std | 4.632573 | 0.297273 | 45.772948 | 180.721109 | 26.868157 | 50.531427 | 10.524455 | 1.582814 | 7.451858 | 26.456559 | 5.786324 | 1.977861 | 20.081184 | 35.650176 | 2.697024 | 2.641271 | 2.326378 | 1.890580 | 4.750978 | 4.181921e+06 |
| min | 2000.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.655740e+05 |
| 25% | 2004.000000 | 1.000000 | 13.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.293000e+05 |
| 50% | 2008.000000 | 1.000000 | 33.000000 | 18.000000 | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.015000e+06 |
| 75% | 2012.000000 | 1.000000 | 73.000000 | 178.000000 | 21.000000 | 44.000000 | 8.000000 | 1.000000 | 3.000000 | 19.000000 | 1.000000 | 1.000000 | 14.000000 | 36.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 4.000000e+06 |
| max | 2015.000000 | 4.000000 | 163.000000 | 716.000000 | 152.000000 | 262.000000 | 59.000000 | 23.000000 | 73.000000 | 160.000000 | 78.000000 | 24.000000 | 232.000000 | 223.000000 | 120.000000 | 30.000000 | 24.000000 | 16.000000 | 32.000000 | 3.300000e+07 |
In the 15 year data set that we selected from the player stats, we collect over 22,000 rows of players offensive stats. The interesting element of these stats is that 75% of the players to the max has an extremely great range. For example, RBIs 75% of the players had 19 or less RBIs but the max is 160 RBIs or the number of hits made by 75% of the players were 44 but had a max of 262 hits in a season. With salary 75% of the players only made up to 4 Million with the max of 33 Million for one season. What we know from these stats is that baseball is a sport that is not easy sport and for one to excel and achieve annual salary in the top 25 percentile 4-33 Million/year, a player will be required to at least perform better the the 75 percentile of players to get a high paying contract.
print (df.dtypes)
print (df.info())
playerID object yearID int64 stint int64 teamID object lgID object G int64 AB int64 R int64 H int64 2B int64 3B int64 HR int64 RBI float64 SB float64 CS float64 BB int64 SO float64 IBB float64 HBP float64 SH float64 SF float64 GIDP float64 salary float64 dtype: object <class 'pandas.core.frame.DataFrame'> Int64Index: 22083 entries, 0 to 22082 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 playerID 22083 non-null object 1 yearID 22083 non-null int64 2 stint 22083 non-null int64 3 teamID 22083 non-null object 4 lgID 22083 non-null object 5 G 22083 non-null int64 6 AB 22083 non-null int64 7 R 22083 non-null int64 8 H 22083 non-null int64 9 2B 22083 non-null int64 10 3B 22083 non-null int64 11 HR 22083 non-null int64 12 RBI 22083 non-null float64 13 SB 22083 non-null float64 14 CS 22083 non-null float64 15 BB 22083 non-null int64 16 SO 22083 non-null float64 17 IBB 22083 non-null float64 18 HBP 22083 non-null float64 19 SH 22083 non-null float64 20 SF 22083 non-null float64 21 GIDP 22083 non-null float64 22 salary 12847 non-null float64 dtypes: float64(10), int64(10), object(3) memory usage: 4.0+ MB None
In the 15 year data set that we selected from the player stats, we collect over 22,000 rows of players offensive stats. The interesting element of these stats is that 75% of the players to the max has an extremely great range. For example, RBIs 75% of the players had 19 or less RBIs but the max is 160 RBIs or the number of hits made by 75% of the players were 44 but had a max of 262 hits in a season. With salary 75% of the players only made up to 4 Million with the max of 33 Million for one season. What we know from these stats is that baseball is a sport that is not easy sport and for one to excel and achieve annual salary in the top 25 percentile 4-33 Million/year, a player will be required to at least perform better the the 75 percentile of players to get a high paying contract.
A total of +22,000 offsensive players were recorded in the dataset from 2000 - 2015. The following full season statistics (columns of continous variables) will be used
Salary:- Players earnings in the season
Categorical Varialbes definition butwill not be used in analysis
display(df.shape)
(22083, 23)
The dataset has 22,083 rows and 23 columns
#plotting the games vs. salary to gain an understanding of the data
#df.plot(kind="scatter",x="salary",y="G")
px.scatter(df,
x="salary", y="G",
title= "Salary by Number of Games Played",
labels={'salary': 'Salary',
'G': 'Number of Games'})
#plotting salary vs player - Difficult to read. Long load time. Is it necessary?
#df.salary.plot.bar()
#check for NA
df.isnull().sum()
playerID 0 yearID 0 stint 0 teamID 0 lgID 0 G 0 AB 0 R 0 H 0 2B 0 3B 0 HR 0 RBI 0 SB 0 CS 0 BB 0 SO 0 IBB 0 HBP 0 SH 0 SF 0 GIDP 0 salary 9236 dtype: int64
Salaries are null for 9236 records.
# Any missing values in the dataset
def plot_missingness(df: pd.DataFrame=df) -> None:
nan_df = pd.DataFrame(df.isna().sum()).reset_index()
nan_df.columns = ['Column', 'NaN_Count']
nan_df['NaN_Count'] = nan_df['NaN_Count'].astype('int')
nan_df['NaN_%'] = round(nan_df['NaN_Count']/df.shape[0] * 100,1)
nan_df['Type'] = 'Missingness'
nan_df.sort_values('NaN_%', inplace=True)
# Add completeness
for i in range(nan_df.shape[0]):
complete_df = pd.DataFrame([nan_df.loc[i,'Column'],df.shape[0] - nan_df.loc[i,'NaN_Count'],100 - nan_df.loc[i,'NaN_%'], 'Completeness']).T
complete_df.columns = ['Column','NaN_Count','NaN_%','Type']
complete_df['NaN_%'] = complete_df['NaN_%'].astype('int')
complete_df['NaN_Count'] = complete_df['NaN_Count'].astype('int')
nan_df = pd.concat([nan_df,complete_df], sort=True)
nan_df = nan_df.rename(columns={"Column": "Feature", "NaN_%": "Missing %"})
# Missingness Plot
fig = px.bar(nan_df,
x='Feature',
y='Missing %',
title=f"Missingness Plot (N={df.shape[0]})",
color='Type',
opacity = 0.6,
color_discrete_sequence=['red','#808080'],
width=800,
height=400)
fig.show()
plot_missingness(df)
We want to see what type of players don't have salary details.
null_data = df[df.isnull().any(axis=1)]
display(null_data)
ax = null_data.boxplot(column=['G', 'AB','H'])
ax.set_yscale('log')
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | alcanis01 | 2000 | 1 | BOS | AL | 21 | 45 | 9 | 13 | 1 | ... | 0.0 | 0.0 | 3 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 14 | allench01 | 2000 | 1 | MIN | AL | 15 | 50 | 2 | 15 | 3 | ... | 0.0 | 2.0 | 3 | 14.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | NaN |
| 15 | allendu01 | 2000 | 1 | SDN | NL | 9 | 12 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 2 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN |
| 16 | allendu01 | 2000 | 2 | DET | AL | 18 | 16 | 5 | 7 | 2 | ... | 0.0 | 0.0 | 2 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 22 | alvarcl01 | 2000 | 1 | PHI | NL | 2 | 5 | 1 | 1 | 0 | ... | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22069 | ynoara01 | 2015 | 1 | COL | NL | 72 | 127 | 14 | 33 | 8 | ... | 1.0 | 0.0 | 3 | 28.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | NaN |
| 22074 | younger03 | 2015 | 2 | NYN | NL | 18 | 8 | 9 | 0 | 0 | ... | 3.0 | 2.0 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN |
| 22078 | zitoba01 | 2015 | 1 | OAK | AL | 3 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 22080 | zobribe01 | 2015 | 2 | KCA | AL | 59 | 232 | 37 | 66 | 16 | ... | 2.0 | 3.0 | 29 | 30.0 | 1.0 | 1.0 | 0.0 | 2.0 | 3.0 | NaN |
| 22082 | zychto01 | 2015 | 1 | SEA | AL | 13 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
9236 rows × 23 columns
Based on the box 75% of the Players with no salary have played less than 75 games or has less than 75 at bats or 10 hits. Although the rest may have significant playing time, we cannot imput the data since the salary of contracts must be manually inputted. As a result of this project, we will remove the rows.
print("Number of Rows before removing:", len(df))
df_clean = df.dropna()
print("Total number of rows after removing the rows with missing values:",len(df_clean))
Number of Rows before removing: 22083 Total number of rows after removing the rows with missing values: 12847
df_clean.dtypes
playerID object yearID int64 stint int64 teamID object lgID object G int64 AB int64 R int64 H int64 2B int64 3B int64 HR int64 RBI float64 SB float64 CS float64 BB int64 SO float64 IBB float64 HBP float64 SH float64 SF float64 GIDP float64 salary float64 dtype: object
# observed that some player IDs had multiple entires in the same year
df_clean = df_clean.groupby(['playerID', 'yearID'], as_index=False).sum()
print (df_clean)
playerID yearID stint G AB R H 2B 3B HR ... SB CS \
0 aardsda01 2004 1 11 0 0 0 0 0 0 ... 0.0 0.0
1 aardsda01 2007 1 25 0 0 0 0 0 0 ... 0.0 0.0
2 aardsda01 2008 1 47 1 0 0 0 0 0 ... 0.0 0.0
3 aardsda01 2009 1 73 0 0 0 0 0 0 ... 0.0 0.0
4 aardsda01 2010 1 53 0 0 0 0 0 0 ... 0.0 0.0
... ... ... ... ... ... .. .. .. .. .. ... ... ...
12838 zumayjo01 2008 1 21 0 0 0 0 0 0 ... 0.0 0.0
12839 zumayjo01 2009 1 29 0 0 0 0 0 0 ... 0.0 0.0
12840 zumayjo01 2010 1 31 0 0 0 0 0 0 ... 0.0 0.0
12841 zuninmi01 2014 1 131 438 51 87 20 2 22 ... 0.0 3.0
12842 zuninmi01 2015 1 112 350 28 61 11 0 11 ... 0.0 1.0
BB SO IBB HBP SH SF GIDP salary
0 0 0.0 0.0 0.0 0.0 0.0 0.0 300000.0
1 0 0.0 0.0 0.0 0.0 0.0 0.0 387500.0
2 0 1.0 0.0 0.0 0.0 0.0 0.0 403250.0
3 0 0.0 0.0 0.0 0.0 0.0 0.0 419000.0
4 0 0.0 0.0 0.0 0.0 0.0 0.0 2750000.0
... .. ... ... ... ... ... ... ...
12838 0 0.0 0.0 0.0 0.0 0.0 0.0 420000.0
12839 0 0.0 0.0 0.0 0.0 0.0 0.0 735000.0
12840 0 0.0 0.0 0.0 0.0 0.0 0.0 915000.0
12841 17 158.0 1.0 17.0 0.0 4.0 12.0 504100.0
12842 21 132.0 0.0 5.0 8.0 2.0 6.0 523500.0
[12843 rows x 21 columns]
#reviewing the years included in the dataset
df_clean.yearID.value_counts()
2015 814 2001 814 2008 812 2011 812 2012 811 2013 809 2007 806 2002 804 2014 801 2004 800 2003 799 2000 798 2005 796 2010 790 2006 790 2009 787 Name: yearID, dtype: int64
player_counts = []
for playerID in df_clean['playerID']:
player_counts.append(list(df_clean['playerID']).count(playerID))
print('The average number of times a playerID appears is {:0.4f}'.format(np.mean(player_counts)))
The average number of times a playerID appears is 6.8127
df_clean.duplicated()
0 False
1 False
2 False
3 False
4 False
...
12838 False
12839 False
12840 False
12841 False
12842 False
Length: 12843, dtype: bool
#drop columns with insignificant values
# Drop the non-beneficial ID columns, 'EIN' and 'NAME'.
df_clean = df_clean.drop(columns=['playerID'])
df_clean.head()
| yearID | stint | G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2004 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 300000.0 |
| 1 | 2007 | 1 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 387500.0 |
| 2 | 2008 | 1 | 47 | 1 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 403250.0 |
| 3 | 2009 | 1 | 73 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 419000.0 |
| 4 | 2010 | 1 | 53 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2750000.0 |
# For detecting outliers I will use LocalOutlierFactor. I will use default values of 20 and 'auto'.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
clf=LocalOutlierFactor(n_neighbors=20, contamination='auto')
clf.fit_predict(df_clean)
df_scores=clf.negative_outlier_factor_
df_scores= np.sort(df_scores)
df_scores[0:20]
array([-2553.55270262, -2104.40459025, -2004.77421899, -1949.86900436,
-1861.75941593, -1634.28817647, -1487.24792154, -1392.28149387,
-1373.94061793, -1362.07828775, -1354.64948962, -1335.16228368,
-1266.08438327, -1241.66885913, -1224.46540826, -1212.63352355,
-1209.25294608, -1206.24049872, -1198.70758681, -1157.01770245])
#sns.boxplot(df_scores);
px.box(df_scores)
threshold=np.sort(df_scores)[5]
print(threshold)
df_clean = df_clean.loc[df_scores > threshold]
df_clean = df_clean.reset_index(drop=True)
-1634.288176471475
df_clean.shape
(12837, 20)
- Low (0-1,999,999)
- Medium (2-5,999,999)
- High (6,000,000 - 14,999,999)
- Elite (+15,000,000)
df_clean['salary_cut'] = pd.cut(df_clean['salary'], bins = [0,1999999,5999999,14999999,50000000], labels=["Low", "Medium", "High", "Elite"], right=True)
df_clean['salary_cut_Numeric'] = pd.cut(df_clean['salary'], bins = [0,1999999,5999999,14999999,50000000], labels=[0, 1, 2, 3], right=True)
df_clean.head()
| yearID | stint | G | AB | R | H | 2B | 3B | HR | RBI | ... | BB | SO | IBB | HBP | SH | SF | GIDP | salary | salary_cut | salary_cut_Numeric | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | 1 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327000.0 | Low | 0 |
| 1 | 2011 | 1 | 29 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 418000.0 | Low | 0 |
| 2 | 2012 | 1 | 37 | 7 | 0 | 1 | 0 | 0 | 0 | 0.0 | ... | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 485000.0 | Low | 0 |
| 3 | 2014 | 1 | 69 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 525900.0 | Low | 0 |
| 4 | 2015 | 1 | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1087500.0 | Low | 0 |
5 rows × 22 columns
- On Base Percentage (OBP), quantity of getting on base per at bat
- Slugging Percentage (SLG), quality of hits per at bat
#add new columns
df_clean["OBP"] = np.where((df_clean["AB"] + df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"] + df_clean["SF"]) !=0,
(df_clean["H"] + df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"])/(df_clean["AB"] + df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"] + df_clean["SF"]),
0)
df_clean["SLG"] = np.where(df_clean["AB"] !=0,
(df_clean["H"] + df_clean["2B"]*2 + df_clean["3B"]*3 + df_clean["HR"]*4)/df_clean["AB"],
0)
df_clean.head()
| yearID | stint | G | AB | R | H | 2B | 3B | HR | RBI | ... | IBB | HBP | SH | SF | GIDP | salary | salary_cut | salary_cut_Numeric | OBP | SLG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | 1 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327000.0 | Low | 0 | 0.400000 | 0.000000 |
| 1 | 2011 | 1 | 29 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 418000.0 | Low | 0 | 0.000000 | 0.000000 |
| 2 | 2012 | 1 | 37 | 7 | 0 | 1 | 0 | 0 | 0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 485000.0 | Low | 0 | 0.142857 | 0.142857 |
| 3 | 2014 | 1 | 69 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 525900.0 | Low | 0 | 0.000000 | 0.000000 |
| 4 | 2015 | 1 | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1087500.0 | Low | 0 | 0.000000 | 0.000000 |
5 rows × 24 columns
df_clean.max()
yearID 2015 stint 5 G 163 AB 716 R 152 H 262 2B 59 3B 23 HR 73 RBI 160.0 SB 78.0 CS 24.0 BB 232 SO 223.0 IBB 120.0 HBP 30.0 SH 24.0 SF 16.0 GIDP 32.0 salary 33000000.0 salary_cut Elite salary_cut_Numeric 3 OBP 1.0 SLG 5.0 dtype: object
The following boxplot shows the salary distribution for the clean dataframe.
# Create a histogram salaries from 2008.
#plt.hist((df_clean['salary']/1e6), bins=6, color='g', edgecolor='black', linewidth=1.2, align='mid');
#plt.xlabel('salary (millions of $)'), plt.ylabel('Count')
#plt.title('MLB Salary Distribution', size = 14);
#Plotly Histogram - Which one????
px.histogram(df_clean['salary']/1e6, x= "salary",
nbins=20,
title = 'MLB Salary Distribution',
labels= {'salary': 'Salary (Millions of $)'})
The below boxplot shows that the median number of games played in the Elite salary group is 48 higher than the Low salary group. This may suggest that experience or seniority has an efect on skill level and salary.
## G by salary cut
px.box(df_clean,
x="salary_cut", y="G",
color="salary_cut",
title = "Number of Games by Salary Cut",
labels={'salary_cut': 'Salary Cut',
'G': 'Number of Games'})
The below boxplot demonstrates OBP vs Salary Cut. It is shown that the median On Base Percentage slightly increases as salary increases, with the Low median equal to 0.31 and the Elite median equal to 0.34.
## OBP by salary cut
#sns.boxplot( x="salary_cut", y="OBP", data=df_clean).set(title = 'OBP by Salary Cut')
px.box(df_clean,
x="salary_cut", y="OBP",
color="salary_cut",
title = "On Base Percentage by Salary Cut",
labels={'salary_cut': 'Salary Cut',
'OBP': 'On Base Percentage (OBP)'})
The following boxplot shows a slight increase in median slugging percentage with an increase in salary. There appears to be errors on all groups. The slugging percentage should not be greater than 1. These outliers should be explored prior to analysis.
## SLG by salary cut
#sns.boxplot(x="salary_cut", y="SLG", data=df_clean).set(title = 'SLG by Salary Cut')
px.box(df_clean,
x="salary_cut", y="SLG",
color="salary_cut",
title = "SLG by Salary Cut",
labels={'salary_cut': 'Salary Cut',
'SLG': 'Slugging Percentage'})
The below boxplot reflects the defined salary cut groupings and gives insight on each group.
## salary ranges by salary cut
#sns.boxplot(x="salary_cut", y="salary", data=df_clean).set(title = 'Salary by Salary Cut')
px.box(df_clean,
x="salary_cut", y="salary",
color="salary_cut",
title = "Salary by Salary Cut",
labels={'salary_cut': 'Salary Cut',
'salary': 'Salary'})
The scatterplot below demonstrates the relationship between salary and on base percentage. It shows that the OBP clusters get tighter as salary increases. The outliers and / or errors should be explored prior to any analysis.
## scatterplot ranges by salary cut
#sns.scatterplot(x="salary", y="OBP", hue="salary_cut", data=df_clean)
px.scatter(df_clean,
x="salary", y="OBP",
color="salary_cut",
title="On Base Percentage by Salary",
labels={'OBP': 'On Base Percentage',
'salary_cut': 'Salary Cut',
'salary': 'Salary'})
The following scatterplot shows the correlation between runs batted in and the number of homeruns. Filtering the plot, shows that the linear trend and constant variance among all salary groupings.
## scatterplot ranges by salary cut
#sns.scatterplot(x="HR", y="RBI", hue="salary_cut", data=df_clean)
px.scatter(df_clean,
x="HR", y="RBI",
color="salary_cut",
title="Runs Batted In by Homeruns",
labels={'RBI': 'Runs Batted In',
'HR': 'Homeruns',
'salary_cut': 'Salary Cut'})
The next scatterplot shows the relationship between salary and slugging percentage. It behaves in a similar fashion to the On Base Percentage by Salary scatterplot in that the clustering gets tighter as the salary progresses.
## scatterplot ranges by salary cut
#sns.scatterplot(x="salary", y="SLG", data=df_clean)
px.scatter(df_clean,
x="salary", y="SLG",
color="salary_cut",
title="Slugging Perccentage by Salary",
labels={'salary': 'Salary',
'SLG':"Slugging Perccentage",
'salary_cut': 'Salary Cut'})
The below 3D plot shows the correlation between runs batted in, on base percentage, and number of hits. The size of each bubble represents the number of homeruns.
px.scatter_3d(df_clean,
x="RBI", y="OBP",z="H",
color="salary_cut",
size="HR",
title="Runs Batted In (RBI) vs On Base Percentage (OBP) vs Hits (H), Sized by NUmber of Homeruns",
labels= {'salary_cut': 'Salary Cut'})
We observe that there are players that have salaries but no offensive stats. What we realized is that pitchers are also include in this dataset and would be outliers in this analysis.
To ensure that we do not sku or analysis, we will focus on salary on offsensive players. Since we do not have player position in the data set, we will use OBP = 0 (no hitting statistics) to assume that these players are pitchers.
# Delete rows where case numbers are zero
# This deletion is completed by "selecting" rows where case numbers are non zero
df_clean = df_clean.loc[df_clean["OBP"] != 0]
df_clean.shape
(8754, 24)
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(df_clean.corr(), cmap=cmap, annot=True)
f.tight_layout()
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
df_clean2 = df_clean.loc[:,~df_clean.columns.isin([ '2B', '3B', 'HR', 'AB'])]
sns.pairplot(df_clean2, hue="salary_cut", height=2)
<seaborn.axisgrid.PairGrid at 0x18f42db8dc8>
Based on both the correlation plot and scatter plots, we definetly see the following
- Hits are highly correlated to doubles, triples, hrs and RBIs. This make sense since the others are classification of the type of hit. The scatter plot we observe the observations trending positive and correlation table show >0.8 for many of these offensive hitting stats
- We have lower salary players to have limited offensive stats based on the distribution which skews right more vs others.
- We do not see good seperation of a players hitting stats to salary. Fielding statistics are other variables we did not consider that may result in better reasoning why a player receives a higher salary.
column = ['H', 'SLG', 'OBP', 'HR', 'RBI']
for col in column:
plt.subplots(figsize=(20, 8))
sns.violinplot(x="salary_cut", hue="salary_cut", y=col, data=df_clean,
kind='violin', # other options: violin, bar, box, and others
palette='PRGn',
height=7,ci=95)
The first graph is based on the distribution of hits by the salary cut. This data shows that low salary players dont have hits or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that hits are getting more bimodal and we are seeing a secondary peak at 175.
The second graph is based on the distribution of SLG by the salary cut. This data shows that high and elite paid players have very similar SLG distribution. Other things like variables like field statistic could differentiate these players futher that were not part of this analysis.
The third graph is based on the distribution of OBP by the salary cut. This data shows that high and elite paid players have very similar SLG distribution. Other things like variables like field statistic could differentiate these players futher that were not part of this analysis.
The fourth graph is based on the distribution of HomeRuns by the salary cut. This data shows that low salary players dont have homeruns or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that homeruns are getting more bimodal and we are seeing a secondary peak at 30.
The fifth graph is based on the distribution of RBIs by the salary cut. This data shows that low salary players dont have RBI's or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that RBIs are getting fatter between the 50 to 100 range.
# Here we will use PCA for dimensionality Reduction.
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import numpy
import matplotlib.pyplot as plot
# You must normalize the data before applying the fit method
df_PCA = df_clean.loc[:,~df_clean.columns.isin(['playerID', 'stint', 'teamID','lgID','salary','salary_cut', 'salary_cut_Numeric', '2B', '3B', 'HR', 'AB'])]
df_PCA_normalized = (df_PCA - df_PCA.mean())/ df_PCA.std()
pca = PCA(n_components=df_PCA.shape[1])
pca.fit(df_PCA_normalized)
PCA(n_components=16)
# Reformat and view results
loadings = pd.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_PCA_normalized.columns))],
index=df_PCA.columns)
print(loadings)
PC0 PC1 PC2 PC3 PC4 PC5 PC6 \
yearID -0.008901 -0.040801 -0.425967 0.858662 -0.048549 0.232788 0.025705
G 0.324420 -0.059554 -0.060321 -0.008147 -0.094698 -0.001422 -0.088874
R 0.334186 -0.065055 -0.009104 -0.026807 0.024623 -0.023152 -0.013108
H 0.333427 -0.067219 -0.048140 -0.013319 -0.043404 -0.021461 -0.103177
RBI 0.325746 0.096944 -0.129968 -0.083618 -0.004067 -0.021158 -0.083931
SB 0.187306 -0.460372 0.328737 0.211633 0.269830 -0.149503 0.020630
CS 0.205413 -0.441717 0.312554 0.157470 0.195548 -0.169695 0.000791
BB 0.310347 0.060454 -0.059520 -0.050941 0.208172 0.110831 0.090298
SO 0.297771 -0.038630 -0.177738 0.069143 -0.051413 -0.055656 0.025003
IBB 0.207870 0.229141 -0.110264 -0.145873 0.633092 0.458901 0.301060
HBP 0.227067 -0.013720 -0.060987 -0.052553 -0.427354 -0.197095 0.810689
SH -0.032789 -0.539919 -0.004463 -0.241925 -0.340085 0.723632 -0.015357
SF 0.269900 0.042612 -0.161400 -0.112905 -0.106954 -0.058005 -0.346518
GIDP 0.277258 0.061816 -0.202465 -0.085261 -0.174564 -0.055060 -0.278578
OBP 0.147237 0.331524 0.551140 0.227766 -0.143847 0.244581 0.030975
SLG 0.190279 0.323551 0.409920 0.152711 -0.259698 0.196314 -0.132832
PC7 PC8 PC9 PC10 PC11 PC12 PC13 \
yearID -0.047531 0.070211 0.072329 -0.006102 0.082508 -0.045151 -0.010539
G -0.037211 -0.159018 -0.089627 -0.026453 -0.276570 -0.369297 0.723044
R 0.052173 -0.046977 -0.072740 0.192712 0.233202 -0.303457 -0.222097
H -0.048840 -0.093236 0.076436 0.126309 0.007354 -0.401241 -0.068613
RBI 0.077025 0.017960 -0.041650 0.079085 0.044112 -0.272401 -0.524190
SB -0.005245 0.118673 0.095730 0.609859 -0.166458 0.252253 0.030747
CS -0.023602 -0.006631 0.160584 -0.722041 0.115735 -0.053018 -0.085690
BB 0.060376 -0.080228 -0.349951 0.024729 0.700104 0.311951 0.273813
SO 0.349519 -0.192785 -0.492946 -0.167943 -0.490595 0.369833 -0.179065
IBB -0.034845 0.092351 0.277740 -0.081086 -0.265794 0.005514 0.003435
HBP -0.081712 0.158443 0.163049 0.000043 0.011717 0.060422 0.018270
SH 0.002425 0.005891 -0.036381 0.006191 0.018175 0.038798 -0.068027
SF -0.247870 0.793360 -0.050400 -0.103713 -0.056864 0.182372 0.059063
GIDP -0.281245 -0.452848 0.525610 0.006991 0.018547 0.436380 -0.033567
OBP -0.554420 -0.110210 -0.291875 -0.018781 -0.103187 0.026670 -0.121799
SLG 0.634906 0.143269 0.323125 -0.007684 0.055205 0.064670 0.092390
PC14 PC15
yearID -0.021943 -0.020074
G -0.264310 -0.172349
R 0.451996 -0.659179
H 0.416050 0.705050
RBI -0.697652 0.034812
SB -0.127462 0.028249
CS -0.041164 -0.013947
BB -0.091850 0.161530
SO 0.162019 0.028267
IBB 0.062995 -0.026456
HBP -0.017709 0.016580
SH -0.038237 -0.000385
SF 0.083584 -0.023196
GIDP 0.000561 -0.086900
OBP 0.000035 -0.006559
SLG 0.013560 0.008776
plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()
It looks like we would need 3 PC components to explain 80% of the variance.
- PC0 is based on the hitting metrics like hits, HRs and RBIS that represents this PC
- PC1 is based on the speed of the player as stolen bases (SB), caught stealing (CS) and sacrific hits (SH) that reprsents this PC -PC2 is based on the quality and quantity of the hits of a offensive player since SLG and OBP are metrics that represents this PC.
Rerunning pcA and LDA to have the target be used to figure out what components are needed. The analysis will used the 3 PC that we identified to complete this analysis.
from sklearn.manifold import TSNE from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.preprocessing import StandardScaler from matplotlib.pyplot import scatter import plotly from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line %matplotlib inline
# removing unnessary columns from the df
df_DR = df_clean.loc[:,~df_clean.columns.isin(['playerID', 'stint', 'teamID','lgID','salary_cut', 'salary', '2B', '3B', 'HR', 'AB'])]
# Set our target as the 'charges'
target = df_DR['salary_cut_Numeric']
target_names='salary_cut_Numeric'
Y = target
# Delete the column of target from our table
df_DR = df_DR.drop("salary_cut_Numeric",axis=1)
X = df_DR.values
X = StandardScaler().fit_transform(X)
pca = PCA(n_components=3)
X_pca = pca.fit(X).transform(X) # fit data and then transform it
lda = LDA(n_components=3)
X_lda = lda.fit(X, Y).transform(X) # fit data and then transform it
# print the components
print ('pca:', pca.components_)
print ('lda:', lda.scalings_.T)
pca: [[-0.0089013 0.32441973 0.33418634 0.33342664 0.32574579 0.18730595 0.20541281 0.31034681 0.29777111 0.20787009 0.22706734 -0.03278896 0.2699004 0.27725781 0.14723717 0.19027917] [-0.04080066 -0.05955397 -0.0650549 -0.06721933 0.09694379 -0.46037161 -0.44171758 0.06045535 -0.03863086 0.22914062 -0.01371994 -0.53991885 0.04261204 0.06181653 0.33152366 0.32355106] [-0.42596662 -0.06032014 -0.00910467 -0.04814028 -0.12996987 0.32873751 0.31255342 -0.05951744 -0.17773947 -0.11026498 -0.06098641 -0.00446316 -0.16140007 -0.20246507 0.55113976 0.4099207 ]] lda: [[ 0.41780707 -1.61275672 0.11305793 0.71965124 0.76278146 0.08773866 -0.2566929 0.71432957 -0.24336354 0.2029815 0.01222968 0.19180684 0.003746 0.29643762 -0.05493118 -0.05489834] [-0.48394617 -0.64415856 -1.22405547 1.92064957 0.11561116 -0.14934072 0.2842146 0.50869695 -0.0476467 -0.68400953 -0.01215602 0.27954507 0.30701564 -0.02748849 0.0190746 -0.17976016] [ 0.42187378 0.49169227 -1.22036286 1.75266478 -0.7163932 -0.09239516 -0.08272393 -0.88521599 0.12664837 0.49132436 0.26225622 0.32058846 0.23082037 -0.03381534 0.18119239 -0.15068359]]
# this function definition just formats the weights into readable strings (from class notes).
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s ' % (comp[fidx],f[:8])
tmp_array.append(tmp_string)
return tmp_array
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_DR.columns)
# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_pca[:,0], X_pca[:,1], c=Y)
plt.xlabel('Principal Component 1', fontsize=30)
plt.ylabel('Principal Component 2', fontsize=30)
plt.title('Principal Component Analysis 1', fontsize=50)
plt.tick_params(axis='both', which='major', labelsize=15)
plt.tick_params(axis='both', which='minor', labelsize=15)
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_DR.columns)
# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_pca[:,1], X_pca[:,2], c=Y)
plt.xlabel('Principal Component 2', fontsize=30)
plt.ylabel('Principal Component 3', fontsize=30)
plt.title('Principal Component Analysis 2', fontsize=50)
plt.tick_params(axis='both', which='major', labelsize=15)
plt.tick_params(axis='both', which='minor', labelsize=15)
print('\033[1m' + 'Principal Component 1: ' + '\033[0m' ,pca_weight_strings[0])
print('\033[1m' + '\nPrincipal Component 2: \n' + '\033[0m',pca_weight_strings[1])
print('\033[1m' + '\nPrincipal Component 3: \n' + '\033[0m',pca_weight_strings[2])
Principal Component 1: -0.01*yearID +0.32*G +0.33*R +0.33*H +0.33*RBI +0.19*SB +0.21*CS +0.31*BB +0.30*SO +0.21*IBB +0.23*HBP -0.03*SH +0.27*SF +0.28*GIDP +0.15*OBP +0.19*SLG Principal Component 2: -0.04*yearID -0.06*G -0.07*R -0.07*H +0.10*RBI -0.46*SB -0.44*CS +0.06*BB -0.04*SO +0.23*IBB -0.01*HBP -0.54*SH +0.04*SF +0.06*GIDP +0.33*OBP +0.32*SLG Principal Component 3: -0.43*yearID -0.06*G -0.01*R -0.05*H -0.13*RBI +0.33*SB +0.31*CS -0.06*BB -0.18*SO -0.11*IBB -0.06*HBP -0.00*SH -0.16*SF -0.20*GIDP +0.55*OBP +0.41*SLG
The principal component 1, 2, and 3 are a linear sum of the different features.
In principal component 1, games played, plate appearances, scores, hits, RBI, walks are the major components of a players offsensive raw stats.
In principal component 2, stolen bases, caught stealing, intentional walks, sacrifice hits are the major components of players speed.
In principal component 3, as noted previously is based on the quality and quantity of hits that were depicted with OBP and SLG.
Of the 3 components, Component 1 is the most important to separate the salary classes of intersection (low, medium, high and elite ) from one another. Our observation would suggest that values greater than 2.5 we can a player will be likely to have high (green) and elite (yellow) salaries (greater than 6 million per year in salary). We also noted that when the offensive stats noted in PC1 are less than 2.5 your likelyhood salary will be in the range of low to medium(less than 6 million per year in salary)
#LDA graphs
lda_weight_strings = get_feature_names_from_weights(lda.scalings_.T, df_DR.columns)
# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_lda[:,0], X_lda[:,1], c=Y)
plt.xlabel('Component 1', fontsize=30)
plt.ylabel('Component 2', fontsize=30)
plt.title('Linear Discriminant Analysis 1', fontsize=50)
plt.tick_params(axis='both', which='major', labelsize=20)
plt.tick_params(axis='both', which='minor', labelsize=20)
#LDA graphs
lda_weight_strings = get_feature_names_from_weights(lda.scalings_.T, df_DR.columns)
# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_lda[:,0], X_lda[:,2], c=Y)
plt.xlabel('Component 1', fontsize=30)
plt.ylabel('Component 3', fontsize=30)
plt.title('Linear Discriminant Analysis 2', fontsize=50)
plt.tick_params(axis='both', which='major', labelsize=20)
plt.tick_params(axis='both', which='minor', labelsize=20)
print('\033[1m' + 'Component 1: ' + '\033[0m' ,lda_weight_strings[0])
print('\033[1m' + '\nComponent 2: \n' + '\033[0m',lda_weight_strings[1])
print('\033[1m' + '\nComponent 3: \n' + '\033[0m',lda_weight_strings[2])
Component 1: 0.42*yearID -1.61*G +0.11*R +0.72*H +0.76*RBI +0.09*SB -0.26*CS +0.71*BB -0.24*SO +0.20*IBB +0.01*HBP +0.19*SH +0.00*SF +0.30*GIDP -0.05*OBP -0.05*SLG Component 2: -0.48*yearID -0.64*G -1.22*R +1.92*H +0.12*RBI -0.15*SB +0.28*CS +0.51*BB -0.05*SO -0.68*IBB -0.01*HBP +0.28*SH +0.31*SF -0.03*GIDP +0.02*OBP -0.18*SLG Component 3: 0.42*yearID +0.49*G -1.22*R +1.75*H -0.72*RBI -0.09*SB -0.08*CS -0.89*BB +0.13*SO +0.49*IBB +0.26*HBP +0.32*SH +0.23*SF -0.03*GIDP +0.18*OBP -0.15*SLG
In component 1, games played, plate appearances, scores, hits, RBI, walks are the major components of a players offsensive raw stats.
In component 2, stolen bases, caught stealing, intentional walks, sacrifice hits are the major components of players speed.
In component 3, as noted previously is based on the quality and quantity of hits that were depicted with OBP and SLG.
Based on the LDA we note that the prediction of elite salary is based on the raw offesnive stastics and quality of hits. We observe that with component 1 is greater 2.5 and component 3 is greater 0, we observe that a offensive player to obtain salaries of high to elite classes.
In conclusion, for offensive players to obtain the best salary from a team, players need to perform at least the 75% percentile of all players stats to be considered by team GMs with large salary contracts.
Opportunities of future analysis improvement
- Include players positions to omit pitchers from this analysis when determining the salary class of offensive players
- Include players fielding statistics in combination of offensive statistics.